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Abstract 

There has been a lot of recent work on Bayesian methods for reinforcement 
learning exhibiting near-optimal online performance. The main obstacle facing 
such methods is that in most problems of interest, the optimal solution involves 
planning in an infinitely large tree. However, it is possible to obtain stochastic 
lower and upper bounds on the value of each tree node. This enables us to use 
stochastic branch and bound algorithms to search the tree efficiently. This paper 
proposes two such algorithms and examines their complexity in this setting. 



1 Introduction 



Various Bayesian methods for exploration in Markov decision processes (MDPs) and 
for solving known partial ly-observable Markov decision processes (POMDP s), were 
proposed previously (c.f. jPoupart et all 120061 IPufj. 120021 IRoss et all l2008ll ), How- 
ever, such methods often suffer from computational tractability problems. Optimal 
Bayes ian explorat ion requires the creation of an augmented MDP model in the form of 
a tree I Duff. 1200211 . where the root node is the current belief-state pair and children are 
all possible subsequent belief-state pairs. The size of the belief tree increases exponen- 
tially with the horizon, while the branching factor is infinite in the case of continuous 
observations or actions. 

In this work, we examine the complexity of efficient algorithms for expanding the 
tree. In particula r, we propose and analyse stochastic se arch methods similar to the 



ones proposed in IBu beck et al. . 2008, N orkin et alll 199811 . Related methods have been 



pre viously examined experimentally in th e context of Bayesian reinforcement learning 
in llDimitrakakisl l2008l I Wang et all 12005 1 . 

The remainder of this section summarises the Bayesian planning framework. Our 
main results are presented in Sect. [2] Section [3] concludes with a discussion of related 
work. Technical proofs and related results are presented in the Appendix. 



1.1 Markov Decision Processes 

Reinforcement learning [c.f. PutermarJ 1 1994.2005 11 is discrete-time sequential deci- 



sion making problem, where we wish to act so as to maximise the expected sum of 
discounted future rewards E£^ =1 Y'«+jfc> where r t £ R is a stochastic reward at time t . 
We are only interested in rewards from time t to T > 0, and y G [0, 1] plays the role of 
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a discount factor. Typically, we assume that y and T are known (or have known prior 
distribution) and that the sequence of rewards arises from a Markov decision process 
p: 

Definition 1 (MDP) A Markov decision process is a discrete-time stochastic process 
with: A state s t £ S at time t and a reward r t £ R, generated by the process p, and an 
action a t £ A, chosen by the decision maker. We denote the distribution over next states 
s t+ \, which only depends on s t and a t , by p{s t +\ \s tl a t ). Furthermore p{r t +\ \s t ,a t ) is 
a reward distribution conditioned on states and actions. Finally, p(r t+ \,s t+ \\s tl a t ) = 
p(r t+ i\s tl a t )p(s t+l \s t ,a t ). 

In the above, and throughout the text, we usually take p(-) to mean P ;U ( ), the distri- 
bution under the process p, for compactness. Frequently such a notation will imply a 
marginalisation. For example, we shall write p(s t+ i;\s t ,a t ) to mean: 

K s t+k,---,st+i\s t ,at)- 

*f+i)-.*f+t-i 

The decision maker takes actions according to a policy 7t, which defines a distribution 
n(a t \s t ) over conditioned on the state s t , i.e. a set of probability measures over A 
indexed by s t . A policy K is stationary if n(a t — a\s, — s) = n(a t i — a\s t t — s) for all 
t ,t'. The expected utility of a policy Jt selecting actions in the MDP p, from time t to T 
can be written as the value function: 



U=i 



(1) 



where Ej[ ijU denotes the expectation under the Markov chain arising from acting policy 
Jt on the MDP p. Whenever it is clear from context, superscripts and subscripts shall be 
omitted for brevity. The optimal value function will be denoted by V* =max K V n . If the 
MDP is known, we can evaluate the opt imal value function p olicy in time polynomial 
to the sizes of the state and action sets jPutermani 1 1 994.200511 via backwards induction 
(value iteration). 

1.2 Bayesian Reinforcement Learning 

If the MDP is unknown, we may use a Bayesian framework to represent our uncer- 
tainty I DufiL 2002 1 . This requires maintaining a belief about which MDP p e M 



corresponds to reality. More precisely, we define a measurable space (fW ,9Jt), where 
M is a (usually uncountable) set of MDPs, and 9JI is a suitable a-algebra. With an ap- 
propriate initial density £o(a0, we can obtain a sequence of densities \ t (p), representing 
our subjective belief at time f, by conditioning §t(ji) on the latest observations: 

e / \ a p{r t+ i,s t+ i\s t ,a t % t (p) 

Jm P t'{rt+i,st+i\s t ,a t )t, t (p')dp> 

In the following, we write to denote expectations with respect to any belief 

1.3 Belief-Augmented MDPs 

In order to optimally select actions in this framework, it i s necessary to explicitly take 



into account future changes in the belief when planning IPufA l2002h . The idea is to 



combine the original MDP's state s t and our belief state into a hyper-state. 
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Definition 2 (BAMDP) A Belief- Augmented MDP v (BAMPD) is an MDP with a set 
of hyper-states Q. — S x 58, where 58 is an appropriate set of probability measures on 
M and S are the state and action sets of all p £ M . At time t, the agent observes 
the hyper-state (i> t = (s t ,^ t ) £ £2 and takes action a t £ A. We write the transition 
distribution as V 1 1 C0 r , a, ) and the reward distribution as v(r t \(ti t ). 

The hyper-state co r has the Markov property. This allows us to treat the BAMDP as 
an infinite-state MDP with transitions v(C0f+i \(0 t ,a t ), and rewards v(r f |co r )Q When 
the horizon T is finite, we need only require expand the tree to depth T — t. Thus, 
backwards induction starting from the set of terminal hyper-states D,t and proceeding 
backwards to T — 1 , . . . , t provides a solution: 

V„»=ma X E v (r|co)+y£ v(co>,a)V; +1 (co'), (3) 

where fi„ is the set of hyper-states at time n. We can approximately solve infinite- 
horizon problems if we expand the tree to some finite depth, if we have bounds on the 
value of leaf nodes. 



1.4 Bounds on the Value Function 

We shall relate the optimal value function of the BAMDP, V*(co), for some fi>(s,£), 
to the value functions of MDPs p £ M for some %. The optimal policy for p is 
denoted as K*(p). The mean MDP resulting from belief \ is denoted as p^ and has the 
properties: p^(s t +i\s t ,a t ) = Ef.\jj(s t +i \s t ,a t )], p^{r t +i \s t ,a t ) = E^(r t +i\s t ,a t )]. 

Proposition 1 Dimitrakakis\ $2008] For any CO = (s, the BAMDP value function V* 
obeys: 

f vf%)$(M)dn>V*(a>)> f vfW{s)%iti)dv (4) 

J<M J<M 



Proof By definition, V*(co) > V n ((o) for all co, for any policy n. It is easy to see that 
the lower bound equals V n ^ (co), thus proving the right hand side. The upper bound 
follows from the fact that for any function /, max* f f(x,u)du < J max x f(x,u)du. 

I 

If M is not finite, then we cannot calculate the upper bound of V(co) in closed form. 
However, we can use Monte Carlo sampling: Given a hyper-state CO = (s,t,), we draw m 
MDPs from its belief \: pi , . . . ,p m ~ y^j estimate the value function for each p^, k = 

< (w) (4 and average the samples: p» m 4 1 ^ k . Let v» 4 J M ^(p) V ;(s a )dp. 
Then, lim„,^oc[v™ m ] = v® almost surely and E[v™ m ] = v a . 

Lower bounds can be calculated via a similar procedure. We begin by calculating 
the optimal policy n*(fi^) for the mean MDP p^ arising from Z,. We then compute vf k = 

V n the value of that policy for each sample p/, and estimate vf m = ^ Ylk=\ v^ k . 

1 Because of the way that the BAMDP V is constructed from beliefs over M , the next reward now depends 
on the next state rather than the current state and action. 

2 In the discrete case, we sample a multinomial distribution from each of the Dirichlet densities indepen- 
dently for the transitions. For the rewards we draw independent Bernoulli distributions from the Beta of each 
state-action pair. 
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Figure 1: A belief tree, where the rewards are ignored for simplicity, with actions 
A = {a 1 , a 2 } and states S = {s l ,s 2 }. 



2 Complexity of belief tree search 

We now present our main results. Detailed proofs are given in the appendix. We 
search trees which arise in the context of planning under uncertainty in MDPs using 
the BAMDP framework. We can use value function bounds on the leaf nodes of a par- 
tially expanded BAMDP tree to obtain bounds for the inner nodes through backwards 
induction. The bounds can be used both for action selection and for further tree expan- 
sion. However, the bounds are estimated via Monte Carlo sampling, something that 
necessitates the use of stochastic branch and bound technique to expand the tree. 

We analyse a set of such algorithms. The first is a search to a fixed depth that 
employs exact lower bounds. We then show that if only stochastic bounds are available, 
the complexity of fixed depth search only increases logarithmically. We then present 
two stochastic branch and bound algorithms, whose complexity is dependent on the 
number of near-optimal branches. The first of these uses bound samples on leaf nodes 
only, while the second uses samples obtained in the last half of the parents of leaf 
nodes, thus using the collected samples more efficiently. 

2.1 Assumptions and Notation 

We present the main assumptions concerning the tree search, pointing out the relations 
to Bayesian RL. The symbols V and v have been overloaded to make this correspon- 
dence more apparent. The tree that has a branching factor at most (j). The branching is 
due to both action choices and random outcomes (see FigJTJ. Thus, the nodes at depth 
k correspond to the set of hyper-states {co,^} in the BAMDP. By abusing notation, we 
may also refer to the components of each node CO = (s,Z,) as s(co), ^(co). 

We define a branch b as a set of policies (i.e. the set of all policies starting with 
a particular action). The value of a branch b is V b = max^y 11 . The root branch 
is the set of all policies, with value V*. A hyper-state CO is fr-reachable if 3k G b 
s.t P^.v(to|co r ) > O.Any branch b can be partitioned at any b-reachable co into a set of 
branches B(b,&). A possible partition is any bj — {k G b : i = argmax (( 7t(a|co)} for any 
bt 6 B(b,(i)). We simplify this by considering only deterministic policies. We denote 
the ^-horizon value function by V b (k) = max x& bV t K k ((d t ). For each tree node co = (s,^), 

we define upper and lower bounds V(/(co) = E^[y*(i)], v'i(co) = E^V 71 ^\s)}, from 
©. By fully expanding the tree to depth k and performing backwards induction (0, 
using either vy or vi as the value of leaf nodes, we obtain respectively upper and lower 
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Algorithm 1 Flat oracle search 



1: Expand all branches until depth k = log^e/p or Al > P"/ — e - 
2: Select the root branch b* = arg max fc V* (k) . 



Algorithm 2 Flat stochastic search 




, 0* } be the set of all £-step children of co 



6: end for 

7: Calculate V b 

8: return = arg max V*. 



bounds Vj}{k),Vl(k) on the value of any branch. Finally, we use C(oo) for the set of 
immediate children of a node co and the short-hand £2^ for C fc (to), the set of all children 
of CO at depth k. We assume the following: 

Assumption 1 (Uniform linear convergence) There exists ye (0, 1) and p > s.t. for 
any branch b, and depth k, V b - Vl(k) < pY, V$(k) - V b < py*. 

Remark 1 For BAMDPs with r t G [0, 1] and y < 1, Ass. Q] holds, from boundedness 
and the geometric series, with p = 1/(1 — y), since V b (k) andVy(k) are the k-horizon 
value functions with the value of leaf nodes bounded in 1/(1 — y). 

We analyse algorithms which search the tree and then select an (action) branch b*. 
For each algorithm, we examine the number of leaf node evaluations required to bound 
the regret V*-V 6 *. 

2.2 Flat Search 

With exact bounds, we can expand all branches to a fixed depth and then select the 
branch b* , with the highest lower bound. This is Alg. Q] with complexity given by the 
following lemma. 

Lemma 1 Alg. \J]on a tree with branching factor §, ye (0,1), samples (c|> 1 ~ , ~ lo& > ,e ' / ' 3 ) 
times to bound the regret by £. 

Proof Bound the ^-horizon value function error with Ass. Q] and note that there are 
§ k+1 leaves. | 

In our case, we only have a stochastic lower bound on the value of each node. Algo- 
rithm |2 expands the tree to a fixed depth and then takes multiple samples from each 
leaf node. 

Lemma 2 Calling Alg.^\with k = [log y e/2P], m — 2 [log Y (e/2p)] • log(|), we bound 
the regret by £ using f ()) 1+los T e / 2 P log y (e/2p) -log(j)J samples. 
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Algorithm 3 Stochastic branch and bound 1 



Let Lo be the root, 
for n = 1,2, ... do 
for co 6 L„ do 

m a ,++, A /~^((»),v» 1 = ^( ff (eo)). 

"CO 1 x^inm ,-;(0 

v u - ^U=\ v i 
end for 

to* ^argmax^v™. 
L n+ l = C({b*)UL n \fb* 
end for 



Proof The regret now is due to both limited depth and stochasticity. We bound each 
by e/2, the first via Lem.[T]and the second via Hoeffding's inequality. | 

Thus, stochasticity mainly adds a logarithmic factor to the oracle search. We now 
consider two algorithms which do not search to a fixed depth, but select branches to 
deepen adaptively. 

2.3 Stochastic Branch and Bound 1 

A stochastic b ranch and bound alg orithm similar to those examined here was originally 



developed by Norki n et al.l 11199811 for optimisation problems. At each stage, it takes 
an additional sample at each leaf node, to improve their upper bound estimates, then 
expands the node with the highest mean upper bound. Algorithm[3]uses the same basic 
idea, averaging the value function samples at every leaf node. 

In order to bound complexity, we need to bound the time required until we discover 
a nearly optimal branch. We calculate the number of times a suboptimal branch is 
expanded before its suboptimality is discovered. Similarly, we calculate the number of 
times we shall sample the optimal node until its mean upper bound becomes dominant. 
These two results cover the time spent sampling upper bounds of nodes in the optimal 
branch without expanding them and the time spent expanding nodes in a sub-optimal 
branch. 

Lemma 3 IfN is the ( random) number of samples Vifrom random variable V G [0, p] 
we must take until its empirical mean 

% - E?= i n > E V - A, then: 

E[N] < l+p 2 A~ 2 (5) 
P[N >n}< exp (-2p~ 2 n 2 A 2 ) . (6) 



Proof The first inequality follows from the Hoeffding inequality and an integral bound 
on the resulting sum, while the second inequality is proven directly via a Hoeffding 
bound. | 

By setting A to be the difference between the optimal and second optimal branch, we 
can use the above lemma to bound the number of times the leaf nodes in the optimal 
branch will be sampled without being expanded. The converse problem is bounding 
the number of times that a suboptimal branch will be expanded. 
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Algorithm 4 Stochastic branch and bound 2 



1: for CO e L n do 

<■>. V<B — 1 v V' m o> ,1™' 

i-c£>'ec((i)) to' v ' 

3: end for 

4: Use (01 to obtain for all nodes. 

5: Set coo to root. 

6: for d = 1 , ... do 

7: a* = argmax^i;,^ co rf _ 1 (;|a)y c/ (co) 

8: & d ~ COrf-iO'lfl*,) 

9: if co</ e i„ then 
io: L n+ i = c((a d )UL„\(a d 
li: Break 
12: end if 

13: end for 



Lemma 4 Ifb is a branch with V b = V* — A, then it will be expanded at least to depth 
Icq = logyA/p. Subsequently, 

P(K >k)<0 (exp{-2p- 2 [(k-k )A 2 ] }). (7) 



Proof In the worst case, the branch is degenerate and only one leaf has non-zero prob- 
ability. We then apply a Hoeffding bound to obtain the desired result. | 

2.4 Stochastic Branch and Bound 2 

The degeneracy is the main problem of Alg. [3] Alg. |4]not only propagates upper 
bounds from multiple leaf nodes to the root, but also re-uses upper bound samples from 
inner nodes, in order to handle the degenerate case where only one path has non-zero 
probability. (Nevertheless, Lemma [3] applies without modification to Alg. |4]>. Because 
we are no longer operating on leaf nodes, we can take advantage of the upper bound 
samples collected along a given trajectory. However, if we use all of the upper bounds 
along a branch, then the early samples may bias our estimates a lot. For this reason, if 
a leaf is at depth k, we only average the upper bounds along the branch to depth k/2. 
The complexity of this approach is given by the following lemma: 

Lemma 5 Ifb is s.t. V b = V* — A, it will be expanded to depth ko > logyA/P and 
P(K >k)< exp (-2(k - k ) 2 (l - y 2 )) , k > k 



Proof There is a degenerate case where only one sub-branch has non-zero probabil- 
ity. However we now re-use the samples that were obtained at previous expansions, 

Af 1— 

thus allowing us to upper bound the bias by ^±^jz^ ■ This allows to use a tighter 
Hoeffding bound and so obtain the desired outcome. | 

This bound decreases faster with k. Furthermore, there is no dependence on A after the 
initial transitory period, which may however be very long. The gain is due to the fact 
that we are re-using the upper bounds previously obtained in inner nodes. Thus, this 
algorithm should be particularly suitable for stochastic problems. 
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2.5 Lower Bounds for Bayesian RL 



We can reduce the branching factor (j >, (which is [ n x S x | for a full search) by 
employing sparse sampling methods jKearns et all Il999ll to o||^j exp[l / (1 — y)]}. 
This was essentially the approach employed by I Wang et all 1200511 . However, our 
main focus here is to reduce the depth to which each branch is searched. 

The main problem with the above algorithms is the fact that we must reach Icq = 
[logy A] to discard A-optimal branches. However, since the hyper-state co r arises from 
a Bayesian belief, we can use an additional smoothness property: 

Lemma 6 The Dirichlet parameter sequence \|/ ( jn t , with n t = Y,f=i ¥r> is a c-Lipschitz 
martingale with c t — 1 /2(n, + 1). 

Proof Simple calculations show that, no matter what is observed, E^(\|/ r+ i/n r+ i) = 
\j/f/«f. Then, we bound the difference \y t +k/ n t+k — Wt/ n t\ by two different bounds, 
which we equate to obtain c t . | 

Lemma 7 If p, p are such that \\l — T ||«, < e and \\r — r\\„ < e, for some e > 0, then 

\\v*-V*\l< f[z~y > f or an y policy k- 

Proof By subtracting the Bellman equations for V,V and taking the norm, we can 
repeatedly apply Cauchy-Schwarz and triangle inequalities to obtain the desired result. 

I 

The above results help us obtain better lower bounds in two ways. First we note that 
initially l/k converges faster than 'f, for large y, thus we should be able to expand less 
deeply. Later, n t is large so we can sample even more sparely. 

If we search to depth k, and the rewards are in [0,1], then, naively, our error is 
bounded by YJn=kf = T /(I — However, the mean MDPs for n > k are close to the 
mean MDP at k due to Lem. [6] This means that p can be significantly smaller than 
1/(1 — y). In fact, the total error is bounded by Y^=k'T{ n —k)/n. For undiscounted 
problems, our error is bounded by T — k in the original case and by T — k[l + \og(T / k)] 
when taking into account the smoothness. 



3 Conclusions and related work 



Much recent work on Bayesian RL focused on myopic e stimates or full expan sion of 



the belief tree up to a certain depth. Exceptions include IPoup art et al. , 2006ll. which 



uses an analytical bound based on sampling a small set of beliefs and I Wan g et al . 
2005 1. which uses Kearn's sparse sampling algorithm I Kearns et al. , 1999ll to expand 



the tree. Both methods have complexity exponential in the horizon, something which 
we improve via the use of smoothness properties induced by the Ba yesian updating. 



There are also connections with work on POMDPs problems [Ro ss et all 120081. 



However this setting, though equivalent in an abstract sense, is not sufficiently close 
to the one we consider. Results on bandit problems, emp loying the same value func- 
tion bounds used herein were reported in ODimitrakakisl 1200811 . which experimentally 
compared algorithms operating on leaf nodes only. 

Related results on the online sample complexity of Bayesian RL wer e developed by 



J Kolter and Ng , 2009ll , who employs a different upper bound to ours and jAsmuth et al 



2009ll . who employs MDP samples to plan in an augmented MDP space, similarly to 
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Auer et alj M2008I1 (who consi der the set of p lausible MDPs) and uses Bayesian con- 
centration of measure results [Zhang, 2006] to prove mistake bounds on the online 
performance of the algorithm. 



Interestingly, Alg. @] resembles HOO |Bubeck et al., 2008] in the way that it tra- 
verses the tree, with two major differences, (a) The search is adapted to stochas- 
tic trees, (b) We use means of samples of upper bounds, rather than upper bounds 
on sample means Fo r these reasons, we are unable to simply restate the arguments 



in llBubeck et a l.. 20081. 



We presented complexity results and counting arguments for a number of tree 
search algorithms on trees where stochastic upper and lower bounds satisfying a smooth- 
nes s property exist. Th ese are the first results of this type and partially extend the results 
of IINorkin et all 1199811 . which provided an asymptotic convergence proof, under sim- 
ilar smoothness conditions, for a stochastic branch and bound algorithm. In addition, 
we introduce a mechanism to utilise samples obtained at inner nodes when calculating 
me an upper bounds a t leaf nodes. Finally, we relate our complexity results to those 
of [Kearns et al. , 1999], for whose lower bound we provide a small improvement. We 
plan to address the online sample complexity of the proposed algorithms, as well as 
their practical performance, in future work. 
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A Proofs of the main results 

Proposition Q] By definition, V*(co) > V n ((i)) for all co, for any policy n. The lower 
bound follows trivially, since 

V**0W((D) ^ J vf^MSnWdft. (8) 

The upper bound is derived as follows. First note that for any function /, max x / f(x, u)du< 
J max x /(x, u) du. Then, we remark that: 

V*(co) = max JvpisafaWdfi (9a) 

< j maxV;^)^)^/ (9b) 

- J vf^M^dfi. (9c) 

I 

LemmaU For any b' with V/' < V L fo , we have: V h> < V/ y + p/ < Vj> + pf < V b + p/. 
This holds for b = b*. Thus, in the worst case, the regret that we suffer if there exists 
some b' :V b > V b is e = V b — V h < PY^- To reach depth k in all branches we need 
n = ZLi ¥ < <b k+1 expansions. Thus, we require k = and n < (j^+^M^P). g 
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Lemma [2] The total number of samples is km, the number of leaf nodes times the 
number of samples at each leaf node. The search is until depth 



and the number of samples is 



log Y e/2p <l+log Y e/2p 



m = 21og Y (e/2P)log(^. 



(10) 



(11) 



The complexity follows trivially. Now we must prove that this bounds the expected 
regret with £. Note that ft"/ 1 < e/2, so for all branches b: 



9i-V b <e/2. 
The expected regret can now be written as 

ER < ^+E[R\vt < Vt + e/4]P(vf < v£ +e/4) 

+ E[R\vt > vf +e/4]P(y/ ? > vf +e/4). 
From the Hoeffding bound (fJTJ 

-2„,-2fc_2 



(12) 

(13) 
(14) 



P(V L - V L > e/4) < exp --mp~ z y~ llc £ 



and with a union bound the total error probability is bounded by (j/exp (— imP~ 2 y £ 2 ) 
If our estimates are within e/4 then the sample regret is bounded by e/4, while the other 
terms are trivially bounded by 1, to obtain 



1 



ER < - + <{ exp ( --mP _2 Y" 2 *e 2 



(15) 



Substituting m and k, we obtain the stated result. 
Lemma |3] 



B-l 



E[N] = Y, n Ti P( y(j) > V + e)P(V(«) <V + e) 

n=l ;=1 



(16) 



n-1 



< £»exp -2p- 2 e 2 £y = £ nexp (-p- 2 e 2 »(« + 1)) (17) 
Let us now set p = exp(— P~ 2 e 2 ). Observe that np"'" +1 ^ < np" 2 , since p < 1. Then, 



note that / np" dn = O I 3iogp ) ■ So we can bound the sum by 



£np"(" +1 )<l + 



«=i 



21ogp 



2 C 2\ 



exp(-p 2 e 
2p- 2 e 2 



< 1 



This proves the first inequality. For the second inequality, we have: 



P(JV>«)=P /\V(k) >y + e < fjexp(-2A:p- 2 e 2 ) 

V*=i / *=i 

= exp (-p~ 2 e 2 n(« + 1)) < exp (-« 2 p- 2 e 2 ) . 
This completes the proof for the first case. The second case ise symmetric. 



(18) 

(19) 
(20) 
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Lemma |4] In order to stop expanding a sub-optimal branch b, at depth k, we must have 
Vy(fc) < V*, since in the worst case VfiQc) = V* for all k. Since V h = V* - A, this only 



happens when k is greater than ko 



log Y A/p 



which is the minimum depth we must 



expand to. Subsequently, we shall note that the probability of stopping is P(Vy(k) > 
A — pY^) < exp(— 2(A — PY^P -2 ). We can not do better due to the degenerate case 
where only one leaf node of the branch has non-zero probability. 
The probability of not stopping at depth k is bounded by: 



P{K >k)<Y[ exp(-2(A- Py j ) 2 p -2 ) < exp 



< exp 



~((A-An)^ 



p/z 

1-Y 2 



h = p(Y 2A '° -y 2 ^ 1 ) -2A(y*° 
= P(A 2 -y 2 ^ 1 ) -2A(A-y t+1 )(l + 



-2p- 2 ^(A-py) 2 

J= k J 



-Y) 



Y). 



Lemma H] Similarly to the previous lemma, there is a degenerate case where only one 
sub-branch has non-zero probability. However this algorithm re-uses the samples that 
were obtained at previous expansions. When at depth k, we average the bounds from 



log Y A/p 



, we shall 



\k/2] to k. Since, in the worst case, we cannot stop until k> ko 
bound the probability that we stop at some depth K > 2ko- Then the mean upper bound 
bias is at most: 



hk = 



1 



k — ko 



py^o i _ f 



< 



A 1-y* 



+ i 



n=k 



k—ko 1-Y k—ko 1 — Y 



The procedure continues only if the sampling error exceeds A — h k , so it suffices to 
bound P(X k > X k + e), where X k = L* =rjt/21 V v (k) and X k = V + h k for e = A(l - 



(*-*o)(l-Y) 



): P{X k >X k + e) <exp 



fk/2] 
2(k-k ) 2 e 2 



Since £ , (PY') 2 =A : 



By setting e = A — /i^ we can bound this by 



exp 



2(£-fc ) 2 (l-Y 2 ) 



1- 



1 -7* 



+ i 



(1 _ Y 2(A-+1)) ^ (k-k )(l-y) / 
For large A:, this is approximately (exp(— k 2 )). | 

Lemma[6] It is easy to see that E(\|/, + i/n ;+ i|^ ; ) = \|/r/n r . This follows trivially when 
no observations are made since = \|/,. When one observation is made, n t+ \ — 
l+n t . Then Etyf+i/^+ilSt) = [Vt + (¥)]/%+ 1 = [Vf + +%) = ¥r/V 

Thus, the matrix 7^ f is a martingale. We shall now prove the Lipschitz property. For 
all k > 0, y t > 0: 



Note that 
Equating the two terms, we obtain 



Mf\/{n,+k) < y't+k/nt+k < (V +k)/n, +k 

M 



is upper bounded by g^g and and thus by ;M;l , , 7 



<:min{y;,H,-y^} 



< 
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Lemma[7] The transitions P,P induced by any policy obey \\P — P\\oo < £. By repeated 
use of Cauchy-Schwarz and triangle inequalities: 

||V-V||oo= \\r-r+y(PV-PV)\\ M 
< \\r-r\\ m +y\\PV -PV^ 
Ke+yWPV-iP-P^W^ 
<e+y(||p(y-y)|| oo + ||i 5 y|| oo ) 

<e+Y(||f>||« J .||V-K|| 0<J + || J P|U.||V|U) 



<e + y \\V-V 



where P = P — P, for which of course holds ||P||oo < E. Solving gives us the required 
result. ■ 



B Hoeffding bounds for weighted averages 

Hoeffding bounds can also be derived for weighted averages. Let us first recall the 
standard Hoeffding inequality: 

Lemma 8 (Hoeffding inequality) Ifx„ = ~£" =1 ;t;, with Xj e [£>,-,£»,- + A/] drawn from 
some arbitrary distribution f and x n = ^ L, E[x,], then, for all £ > 0: 

/ 2n 2 £ 2 \ 

P(*n>*n + e)<exp^- — h2 J. (21) 

We have a weighted sum, x'„ = £" =1 w i^p E;=i w i = I- If we set v i — nvv ,-, then we can 
write the above as -£" =1 v;;tJ. So, if we letx, = v,xj and assume that x\ £ [b,b+h], then 
Xj G [vjb + Vj(b + h)\. Substituting into (f2Tb results in 

P{x„>x + e) - exp (~ ^2^~^? )- (22) 



Furthermore, note that 



2e 



P(jc„ >x + e) <exp^--^-J , (23) 

since wj < Wi for all i, as w; £ [0, 1]. Thus £,vf 2 < Y,i w i = !■ Note that ^,-vv 2 = 1 iff 
wj = 1 for some j. 
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